pmc-at - Software Development

Update on Jan 07, 2015
What is the meaning 'non-standard residues' ?
Answered by PL: In the pdb file, you will see a column with notations like "GLY, GLN, LYS..." et. those are standard amino acid residue names. these residues are automatically recognized because their information is available already. However, some reisude names such as "ALY" etc. are not in the exisiting data library, and can not be recognized by the software automatically are called non-standard residues and we need to provide data information.

AS: Sir I could not understand:-
convert NAD to NDP in NAD+walker.lib, ACE and NME in $AMBER/dat/leap/prep/amino12.in
Answered by PL: The residue NAD is not a standard residue, therefore we need to provide the corresponding data for it. the topology of the residue is in file NAD+walker.lib but using different name called NDP. therefore we need to convert the NAD residue into NDP so we can use the data in the NAD+walker.lib (we need to convert the atom names to the correspoding ones as well.)

The atom names of ACE in the pdb file doesn't match with the available data set which is specified in file $AMBER/dat/leap/prep/amino12.in, therefore we need to change the corresponding atom names in the pdb file

The residue NMA in the pdb file corresponds to the residue NME in the $AMBER/dat/leap/prep/amino12.in, and we need to convert the residue name NMA to NME and the correpsonding atom names

What is internal coordinates(Z-matrix) of a molecule?
The Z-matrix is a way to represent a system built of atoms. A Z-matrix is also known as an internal coordinate representation. It provides a description of each atom in a molecule in terms of its atomic number, bond length,bond angle, and dihedral angle, the so-called internal coordinates.

Update on Dec 24, 2014
What are the input file of MMPBSA.py file?

mmgbsa.in
4BV3_CH_3_Top1.1.amber_score
Complex topology file: 4BVH_A_SIRT3_OAcADPR_1water_init.4BV3_CH_3_Top1.1.prmtop
Receptor topology file: 4BVH_A_SIRT3_OAcADPR_1water_init.prmtop
Ligand topology file: 4BV3_CH_3_Top1.1.prmtop
Initial mdcrd(s): 4BVH_A_SIRT3_OAcADPR_1water_init.4BV3_CH_3_Top1.1.inpcrd

How to individualy run MMPBSA.py?MMPBSA.py -O -i mmgbsa.in -o 4BV3_CH_3_Top1.1.amber_score -cp 4BVH_A_SIRT3_OAcADPR_1water_init.4BV3_CH_3_Top1.1.prmtop -rp 4BVH_A_SIRT3_OAcADPR_1water_init.prmtop -lp 4BV3_CH_3_Top1.1.prmtop -y 4BVH_A_SIRT3_OAcADPR_1water_init.4BV3_CH_3_Top1.1.inpcrd

The final output file is:-4BV3_CH_3_Top1.1.amber_score
Update on Dec 10, 2014
AS: How to execute mm-gbsa script?./prepare_amber.pl test.mo12 test.pdb orperl prepare_amber.pl test.mo12 test.pdbe.g.- ./prepare_amber.pl 4BV3_CH_3_Top1.mol2 4BVH_A_SIRT3_OAcADPR_1water_init.pdb

prepare_amber.pl file takes two argument one is .mo12 (e.g. 4BV3_CH_3_Top1.mol2) and .pdb file(e.g. 4BVH_A_SIRT3_OAcADPR_1water_init.pdb)SECTION:1: generates receptor files; calls amberize_receptor:-To generates receptor files, calls amberize_receptor and send 4BVH_A_SIRT3_OAcADPR_1water_init.pdb as an argument.

amberize_receptor:-
Reads the PDB file for the receptor; adds AMBER force field atom types and charges. Generates Amber parameter/topology (prmtop) and coordinate (inpcrd) files.
This amberize_receptor invokes program tleap.

SECTION:2: adds the AMBER_SCORE_ID to each MOLECULE in the input mol2
Generates a modified mol2 file with suffix amber_score.mol2 (e.g.,4BV3_CH_3_Top1.amber_score.mol2) that appends to each TRIPOS MOLECULE a TRIPOS AMBER_SCORE_ID section containing a unique label (e.g., 4BV3_CH_3_Top1.1). This modified mol2 file should be assigned to the ligand_atom_file input parameter in the DOCK input file.
SECTION:3: Splits the ligand_mol2_file into separate mol2 files, one file per ligand, where the file prefixes are the AMBER_SCORE_ID unique labels( In this example AMBER_SCORE_ID=1 ). These mol2 files are used in the next step.

SECTION:4: generate files for ligand and complex; call amberize_ligand:Runs the antechamber program to determine the semi-empirical charges, such as AM1-BCC, for each ligand. Runs the tleap program to create parameter/topology and coordinate files for each ligand using the GAFF force field (prmtop, frcmod, and inpcrd), and writes a mol2 file for the ligands with GAFF atom types (Here 4BV3_CH_3_Top1.1.gaff.mol2). This is performed via the script amberize_ligand.amberize_complex:- Combines each ligand with the receptor to generate the parameter/topology and coordinate files for each complex.

The script prepare_amber.pl also creates files with extensions .log and .out that contain the detailed results of the above steps. The amberize_*.out files can be browsed to verify that the preparation was successful. If prepare_amber.pl emits any warnings or errors then all these files should be scrutinized. Some errors are fatal, such as an inability to prepare the receptor; generally, and in this case especially start investigating by examining file amberize_receptor.out.

PL(12/2/2014): Update on MM-GBSA script:
The script is available under user app's directory ~/bin.
The perl script prepare_amber.pl is the main program, and three sh script amberize_complex, amberize_ligand, amberize_receptor and the link MMPBSA.py to the actual MMPBSA.py code are required.
In order to run the script, proper setting for AmberTools is required. Check app's .bashrc file for the configuration.
A test case is available under ~/bin/test_mmgbsa.
To run the test case, execute the command "prepare_amber.pl -i 4BV3_CH_3_Top1.mol2 4BVH_A_SIRT3_OAcADPR_1water_init.pdb".

Update on Nov 07, 2014
AS: I have download flowchart of mcsad_protein.py and studying it.
Is mcsad stand for Monte Carlo simulations of atom displacements?
RC: It stands for Monte Carlo simulated annealing, discrete

RC (10-31): Have you had a chance to review mcsad_protein.py?
You previously asked about the i/o for this code. Please let us know if you are able to find some of log files generated by the python protein design code that can be used to understand how the code should properly run. It submits

to the batch server and then reads output from executables at each step. There are input files in the project directory (described below).
Also please let us know when you plan to review the MM-GBSA scripts.
RC: Have you been able to test mcsad_protein.py yet (using protein_design.py executable) on the batch server? I assume you have previously tested batch server

submission (e.g., of a matlab code) with user rajchak as well as ashee and it is now working properly. If so, you should proceed to test the protein design code.
Have you started reviewing the MM-GBSA scripts yet? Please specify when you will be ready to test them.

AS(13/11): When I executed using the command "python protein_design.py"It showing "ImportError: No module named numpy"I think this is the problem of python version(2.4.3) at brahman. We should install latest python version(3.3). Many other function also will not support by python 2.4.3 version.
RC: Yes it is an issue with python. One possible way to deal with this without having to install new software on brahman is to install the python updates on, say slave003, and run the executable from there rather than brahman. In order to make sure the pbs

are submitted through brahman (the pbs server), the qsub commands like
os.system("qsub job_grid_c%i-i%i.ds" %(k,j))
can be edited to use ssh submission of the

on brahman (ssh brahman "qsub...") as indicated under the pbs section of the sysadmin page.
You should first obtain the path to the MM-GBSA scripts from PL and read them.
(See below regarding the ssl certification issue on brahman.)

PL(11/19/2014): On slave003, there are multiple copies of python available. python 2.4.3 is available under /usr/bin/python; python 2.7.6 is available under /usr/local/bin/python; and python 2.7.8 is available under /usr/bin/python2.7. Since many packages is not available directly for python 2.7 from yum, I have installed python27-devel, python27-libs, python27-tools,python27-pip and gcc44, then using pip2.7 to install numpy-1.9.1.
In case python 2.4.3 is not working, and numpy is needed, /usr/bin/python2.7 can be used now. (simply put "alias python=/usr/bin/python2.7" in the .bashrc should work.)
If the same issue exist for other nodes, this solution can be applied.

AS(14/11): I can't find out the path of MM-GBSA script.
I tried using the command "grep -inr "gbsa""Sir, Could you please provide the path of MM-GBSA script?
RC: Has PL been in contact with you? He was supposed to tell you about these scripts some time back. You should schedule a meeting with him asap.
When I tried to install numpy using the command "yum install python-numpy"
I got the result below:-
"up2date_client.up2dateErrors.SSLCertificateVerifyFailedError: The SSL certificate failed verification."

RC: Under the yum section of the sysadmin page, we indicated before that there were ssl certification issues on brahman on the sysadmin page and that certain rpm packages may need to be installed. We had asked you to read that section and post your understanding below it a couple of weeks ago. We can give more input thereafter.

Update on Oct 30, 2014
distribution_fns.py:- This file has two function. compareStates() function takes two list as a argument and return true if last value of fist list is grater than the last value of second list return false if not.

filter() function takes two list as a argument, it return the list if last element is garter than from the second argument(cutoff)(confuse because why loop has been used).

Update on Oct 22, 2014
mcsadoptions_protein.py:- This file needed to create a dictionary and store values to the variables(Some variable meaning could not understand). This file mainly initialize variable with values.

array_fns.py:- This file needed to create list using Regular Expression.

scorefn_protein.py:- Read from the log files and generate list using the function of array_fns.p. Calculate E, energy and affinity(Will read why and where these variable needed).

hardparams.py:- read dist_constraints.txt file and then serializing Python object structure using pickle for future use.

Sir, What is plop energy?
RC: It is a protein energy that is computed using the plop executable, which we use for protein modeling.

Python computational chemistry software development

RC (10-14): The following information was copied from a previous academic group wikispaces site. The Python-MM-GBSA and Sirtuin-MM-GBSA pages of that site have background ppt
slides, reports and notes on how the python code below can be used for protein engineering. PL should review those pages and copy relevant information including ppts here or to dropbox as appropriate
so current team members can access them. The introductory ppt may be useful to AS. The python code below may also serve as a starting point for python-based drug design script development at PMC-AT.

The following notes summarize prior work on the python protein design code. The source code is in brahman.pmc-research.com:/home/rajchak/Simulation/Protein/mcsapython/discretemcsa.
The code is run from a project directory that is specific to the application - currently brahman.pmc-research.com:/home/rajchak/Simulation/Protein/r61_peptide_mcmc- using the python script
protein_design.py.
The code is set up to run on a batch server. The cluster environment has changed since 2012 and hence the code must be tested as a first step.

This code could (with modifications) also be used for other applications like automated drug design workflows.

5-31-2012

Glide, Maestro, as well as the entire suite of Schrodinger products are installed on the cluster:
None of the programs, however, will work without the license file.
The license file must be updated.

5-25-2012

The following changes to the protein design code were made and committed to the repository:

The bug pertaining to the computation of delta_E due to improper storage of the score of the last state in each chain has been fixed.
Temporary files corresponding to plop jobs in the project directory are now erased upon execution of a new design run.
The script protein_design.py has been modified so it can be executed from the project directory.
Comments have been provided next to the definitions of the important arrays. I confirmed that energies, sequences for accepted moves at the current temperature only are stored in mcsad_data.p. We can later modify the code so it stores data at all Ts, but it's not imp now since we are not doing annealing.

Also, the i-1.out.pdb files have been modified so they all correspond to the energy minimized wild-type protein. We can at any time copy "sample" files in the r61_peptide_files directory over these so the chains start from different structures.

5-24-2012

1) "No such file" in the log file simply indicates it tried to delete a file but the file was not there. This is a normal message from the error handler; since some chains will have finished and are waiting for other chains, they will not have corresponding files to delete on that iteration of the loop.

2) scorefn_protein.py assigns a score of 500000 if the energy score from plop is too high (>-10000) or unreported, e.g. due to steric clashes. It appears from browsing your log file that this was happening a lot
for your chains - because the same energy was assigned to all such structures, the delta E in these cases is 0. I did not usually encounter this. Can you tell me what were the starting structures (pdb files) for your chains?
The starting structures are the same chains that were the original ones from you for r61 in the directory:
/home/rajchak/Simulation/Protein/r61_peptide_mcmc/lig_cx-i-1.out.pdb
where x is the chain number from 0 to 5
Also, could you run the code again from different initial structures and save the log files from each run, so I can compare them.
The various starting chain structures (files mentioned above) were created from the PDB crystal coordinates for R61, and you simply mutated a few of the residues at positions where mutations are allowed for optimization. I assume you used a some program, like Maestro (which one did you use?) to mutate those residues. The new side chain atoms and coordinates are filled in (are they also energy minimized?). The various starting chains had identical coordinates for atoms from residues that were not mutated (the majority), thus it seems that each structure was not minimized. Now, when you ask to run the code for different initial structures, do you mean different mutants, or do you mean different starting coordinates from a mutant + a brief minimization?
Large jumps in E can occur due to steric clashes as well. This is one reason total protein energy is not the best scoring function for design without some softening of the potentials. To deal with this temporarily during our speed tests, I set the MC temperature very high, so uphill moves are often accepted.

3) Can you repeat the most important comments from your last test of this code with Brad. I remember you guys mentioned that the pickle file did not have appear to have enough structures stored for each chain. I believe this "discrepancy" might be due to the fact that I have chosen not to save the structures corresponding to rejected moves - only accepted moves are stored as states of the chain.

Here are some previous questions/comments we had from April 20, 2012:

How long does it take the code to converge at one temperature for R61 protein? We believe this means "how long to achieve stationarity for each temperature", but in the R61 case, we are holding temperature constant. Our plan was to run R61 protein, get in the Pickle file, and plot energy scores vs. time. Look for how long it take to reach a stationary state.
We tried to plot the data from mcsad_data.p. But, we do not understand the data structure. For example, there are several keys corresponding to energy terms: allEnergyList, E_traj, energy_traj, and E_traj_sample. Using " j " for the index of the inner loop (equilibrium sampling), we expected that the values for mcsad_data[ 'E_traj' ][ j ] = [energy for chain 1, energy for chain 2, ..., energy for chain 6], for the j'th iteration for the simulated annealing at constant temperature. But, we noticed that i only went up from 0 to 5, but in the output files, there were some files with 25+ iterations, e.g. sample_c4-i26.out.pdb, sample_c5-i25.out.pdb ... . We think this may be because mcsad_data only has the iterations from the last run, or it is not updating properly/is not being overwritten.
A few questions:
- Which energies are the correct ones; that is, the ones that correspond to equilibrium sampling at a constant temperature?
- What does " i " correspond to, if not the iteration step? And to look for convergence, we are looking for a stationary distribution of the energy that is output in mcsad_data['E_traj'][ j ] after " j " iterations. Are we mixing up the i-loop iterations and the j-loop iterations? Nonetheless, we are confused as to why there would only be 5 iterations - the j-loop is executed many, many times, and the previous output files had labels indicating 25+ iterations, as stated before. So, we do not know what these numbers correspond to.

5-22-2012

Eric,

Here are the files I am uploading here on the wiki:

python-code Brad Ridder 5-22-2012.zip
optimize2YCK Brad Ridder 5-22-2012.zip

LamDelosmeTemperatureSchedule

This is an adaptive temperature schedule for MCSA, that can (allegedly) give much better performance. However, it is quite complicated and I was not able to get it fully up and running. Papers on it can be found just by searching for Jimmy Lam and Delosme on Google Scholar.

optimizeDockParameters

Eponymous code for finding good dock parameter sets.
To use it:

1. make a folder. I called mine optimize<pdbcode>, e.g. optimize2YCK, or optimize2H4F.
2. in that folder make a folder called "bestStorage" and one called "sandbox".
3. Edit all of the paths in optimizeDockOptions.cfg to work with the paths on your computer.
4. Follow the first tutorial on the UCSF DOCK website for preparing the receptor and ligand in CHIMERA.
5. Put those resulting files into sandbox.
6. Run dockPrePrep.prep(configFilePath)
7. wait a while.
8. Go into sampleNewAnchorAndGrow.py. Look at your ligand. Change
ANGDict['min_anchor_size'] to be equal to the smallest rigid subunit of your ligand.
9. In PyLab (or whatever python interface makes you happy), type:

cd <the right directory for optimizeDockParameters>
import optimizeDockParameters as ODP
ODP.optimizeDockParameters(configFilePath,numTrials)

numTrials is an integer.

10. Press enter and let'er rip!
11. go to bed.
12. Come back the next day and observe shiny new parameters in "bestStorage" folder.

This might seem laborious, but its actually pretty simple.

rotamer

rotamer.py contains the rotamer class, for doing sampling of rotamer configurations from the Dunbrack Rotamer Library.
Just change self.LibraryPath to match the Dunbrack path on brahma. I've actually demo'ed this code before in a presentation, which I am pretty sure is on DropBox, so I won't reiterate that here.

timekeeper

timekeeper is already pretty well-documented. Also, I do not remember, but I think the version I am uploading here might be out-of-date with the one in the brahma cvs repository. I don't remember though ;(

writePLOPLists

This has already been explained as well, and I think a more recent version has been committed to the CVS repository as well, but I'll include it here just in case it hasn't.

optimize2YCK

This is the output from 50 iterations of optimizeDockParameters on the tetrahydrofolic acid ligand. The fit isn't too good (2.2 RMSD).

-Brad

4-27-2012

Update: Debugging is done. Letting the code simmer for 50 iterations.

-Brad

Hi Dr. Raj and Eric,

I docked tetrahydrofolate (TAHF) into 2YCK today with the initial successful parameter set for the NAD+ runs. The results were lukewarm; about 1.5 angstrom RMSD.

I spent the day today reworking the DOCK parameter optimization code. It is simmering now with only 2 iterations (to ensure it works, for debugging purposes). Once it has passed all debugging, then I'll let it run for the night hunting for a new parameter set for this 2YCK-TAHF system.

All the best friends,

-Brad

4-26-2012

Hi Dr. Raj and Eric,

Worked on more code exploration/refactoring today. Changes have been reflected in the CVS repository. I am not sure to what level of detail is necessary when submitting repository changes. I suspect I am not being precise enough, so I will try to make more frequent commits from now on, so that we have more incremental changes between versions instead of huge tectonic shifts.

I think I'm going to read some papers now.

All the best friends,

-Brad

4-24-2012

Hi Dr. Raj and Eric,

Just more tinkering away with the code.

Just as a note to myself, I refactored hardparams.py, but my changes will need to be propagated over into hardconstraint_protein.py.

-Brad

4-23-2012

Hi Dr. Raj and Eric,

So here is a status update.

So it appears the protein code hanged around midnight on Tuesday night. I Ctrl+C'ed to stop it, and looked at some of the data. I am not sure why it hanged, but maybe 50 iterations was too much. I ran it again today with 10 iterations, and it finished just now without hanging.

However, I do not think the results will be particularly worthy of analysis. While the code was running, I went spelunking around some of the code, and found a few bugs in the convergence.py file that, from a cursory inspection, appeared to over-estimate the Gelman-Rubin R-statistic. I made a replacement for it, computeGelmanRubinR.py, that avoids these bugs, and is a little easier to read (I used numpy array functions instead of loops). I committed this to the CVS repo.

For a random set of 5 chains of 10 samples of normal deviates, convergence.py got ~1.7 (above the rejection threshold), while computeGelmanRubinR.py got ~1.0. One would expect R to be below the rejection threshold, since these samples all came from the same standard normal distribution whose parameters were constant with sampling.

All the best friends,

-Brad

4-22-2012

Hi Dr. Raj and Eric,

I made a slight goof on Saturday. I checked the protein code results - and alas, I forgot to increase the maxTotalIter, so it only ran for a whole 3 iterations ;(.

I have increased maxTotalIter to 50, and the code is still running.

I busied myself today by learning how to plot stuff in pylab, and I also made a loadOptions.py script for loading in all of the same options in mcsadoptions_protein.py in from an easy-to-alter text file.

All the best Dr. Raj,

-Brad

4-20-2012

How long does it take the code to converge at one temperature for R61 protein?
- We believe this means "how long to achieve stationarity for each temperature", but in the R61 case, we are holding temperature constant.
- Run R61 protein. Get in the Pickle file, and plot energy scores vs. time. Look for how long it take to reach a stationary state.

How many sequences were produced?
- We have only let the code run for a short amount of time. We presume this number depends on how long the code is allowed to run. We can let it run for a longer period of time, and then analyze the number of sequences.

Running the code with multiple proteins - which ones?
- Besides R61, what does Raj suggest?
- Maybe things from the PNAS papers?

Compare w/ and w/o catalytic constraints.
- Currently using the catalytic constraints already in the source code for R61.
- We will run the code without these constraints.

How much time will it take to do the timing tests once we have GLIDE?
- We are getting to a better understanding of the code, but right now we can't answer this question.

4-19-2012

timekeeper has been CVS'ed onto brahma. Use "help timekeeper" to see updated documentation for timekeeper.py.
Later today I will implement timekeeper and mcsad_protein.py and observe the output.

-Brad

4-18-2012 (later today...!)

I have fixed up timekeeper. Now it works the way I like it, and I don't think it has any bugs.
I will commit it to the repository once I have the documentation sorted out.

-Brad

4-18-2012

I wrote a new Python class called timekeeper, and it is in (...wait for it...!) timekeeper.py.
We can use this to manage all of our timing tests.

Here's how it works.

myTimer = timekeeper.timekeeper(writePath) #writePath is where to write the timing data files to.

myTimer.startTimeTrial(trialName) #Give the time trial a name. For example, a string could be formatted at each equilibrium loop iteration to say something like "MCSAD Equilibrium Chain k Iteration j".

Then, when you want to stop that timing, you would write:
myTimer.stopTimeTrial(trialName)

To write data to output.
myTimer.writeTimeData()

I foresee two obvious contingencies:
1. Stopping a timer that wasn't started.
2. Failing to stop a timer that was started.

#1 will be handled by writing a warning to the data log, stating the trialName in question, and the time this occurred since the timekeeper class was instantiated.
#2, if a timer is not stopped when writeTimeData() is called, is to write a warning to the output file with the trial name, and when this timer was started since the timekeeper class was instantiated.

I might do a little rearranging - e.g. I'll have writeTimeData solely write data to the file, and make a new method called "concludeTiming()" that will check for errors and write warnings.

-Brad

4-16-2012

Hi guys,

I made a flowchart of the mcsad_protein.py code file. I did this for two good reasons:

1. To teach and understand the code to myself.
2. To make it easier to organize our timing tests.

By looking at the diagram, it should be easy to point out where we want timing to be done.

The pdf file is here:

MCSAD Python Code Timing Test Layout Diagram Brad Ridder 2-12-2012.pdf

-Brad

4/12/2012

Are we comfortable running and analyzing the code?
- Eric and I are OK with running the code, but the output is still a little fuzzy. We are not entirely clear on which output files are which.
- We will email you a list of our remaining questions as we do more running and analysis.

How long does it take the code to converge at one temperature for R61 protein?
- We haven't paid attention to that; we will look into that.

How many sequences were produced?
- We have only let the code run for a short amount of time. We presume this number depends on how long the code is allowed to run. We can let it run for a longer period of time, and then analyze the number of sequences.

Running the code with multiple proteins - which ones?
- Besides R61, what does Raj suggest?
- Maybe things from the PNAS papers?

Compare w/ and w/o catalytic constraints.
- Currently using the catalytic constraints already in the source code for R61.
- We will run the code without these constraints.

How much time will it take to do the timing tests once we have GLIDE?
- We are getting to a better understanding of the code, but right now we can't answer this question.

4/11/2012

Raj's preliminary comments

- My comments here are meant to keep us focused on the overall task at hand - approximate assessment of scalability of the MC protein design algorithm for two different scoring modes

- Automating the analysis of favorable interactions with command line methods could be useful down the line. This is really
a generalization of the methods by which the program now identifies critical interatomic distances relevant to catalysis.
My initial impression is that it would be overkill to generalize this for the purposes of the current test.

- As noted we will need to have a discussion on this prior to timing implementation, and I may play into the coding on that as well. I'm fine with using timeit, the changes would be minimal.
Regarding the different approaches to timing, bear in mind that we will be computing statistics on the time spent on the plop and glide scoring steps, and we are not terribly concerned with the precise number of milliseconds required for a given calculation.
Let me know once you've finished getting familiarized with the workflow of the current code before we proceed.

- Eric, please provide your sirtuin update when you get a chance, prior to further detailed analysis of timing.

Raj

4/11/2012

Meeting with Eric #4 on VMD

Downloaded and installed VMD
Learned how to pscp from cluster to Windows machine
- C:\Users\Abhay\Desktop>pscp.exe -r -v [email protected]:/home/rajchak/Simulation/Protein/r61_peptide_mcmc/r61_peptide_files .

VMD selection commands
- Learned how to load molecules in.
- Selection commands:
  - H-bond viewing command:
    - within 4 of (sidechain and resid 120 123 161 217 233 285 299 301 326 327)
  - resid is the correct term for referring to the PDB residue number.
  - Viewing specific side-chains commands
    - sidechain and resid 120 123 161 217 233 285 299 301 326 327

We visualized a number of iterations from chain #0 of the r61 protein, such as lig_c0-i0.out.pdb and sample_c0-i4.out.pdb.
VMD cannot load the trajectory of the MCSAD chains as a VMD trajectory, because the number of atoms is changing from frame to frame.
- Every iteration of the trajectory must be loaded as a separate molecule. This requires extra steps for viewing the trajectory.
- We could write a VMD script to handle this visualization task, but not at this time.
- We discussed how to visualize the favorable interactions we are looking for, e.g. H-bonds, Coulombic, or VDW interactions that stabilize the protein.

We imagine using a script to count the number of H-bonds to the mutated sidechains, which may change for every iteration, and could vary between Monte-Carlo chains.
- Our idea is to use a series of data analysis scripts to do various things, e.g. count H-bonds, in order to red-flag important data from the MCSAD output. Then, we can personally view it in VMD to look for more specific details.

Timing the algorithm
- Brad changed Python time.clock() to time.time(), which we believe is a better indicator of wall clock time
- Eric discussed some issues with how wall-clock time can be deceptive, especially when working on a shared cluster.
- Our cluster is not shared, so wall clock time should be okay to go by.
- time.clock() appears to measure CPU time - how much "oomph" the CPU has expended on the particular job - possibly not including I/O, system calls, or other things.
- We've decided on using time.time() for timing the overall program.
- For timing the sections of the program i.e. GLIDE calls or DOCK calls, timing for iterations of the different MC chains, etc., we need to use a more sophisticated timing module, such as the Python module timeit. Further investigation is required.

4/10/2012

Meeting with Eric #3 on MCSAD Python code.

Timing test discussion.

Get at the overall timing by reading mcsad_data.p. The pickle file mcsad_data.p saves important data from the job.
- For each situation we will test scoring/energy for these three scoring methods:
  - PLOP energy score (we can do this now)
  - GLIDE score (do when get trial/licenses)
  - DOCK score (maybe)
- Timing tests will be done for the following situations:
  - Number of chains
  - Number of mutatable residues
  - The look of this data could be a 3D plot; one axis is the number of chains, another the number of mutatable residues, and the 3D surface is the computational time. Different protein-ligand systems would likely generate different "computational time surfaces."
  - For PLOP specifically, the number of residues flagged for relaxation at each mutation site. (regarding writePLOPLists, this would correspond to the radius of the encroaching sphere.)

Then do detailed timing as discussed below.
- First contact Dr. Raj about this.
- Issues:
  - First, the program adds a new state to the state history and energy history lists only if a move is accepted. Since move acceptance depends on problem-specific features, we are more interested in the time required per trial.
  - We will eventually break the latter down into several components, e.g.:
    - side chain optimization time
    - grid generation time
    - binding affinity calculation time, etc.
  - In order to get at these values, we will use the list nTrials, which stores the number of trials per chain.
  - Also, we will make clock calls around and store those in new lists, in order to calculate timing statistics for each of the steps of move generation and the different steps of scoring.

Analysis with VMD
- Learn how to use VMD/install it/run it, etc.
- Load the current results from the r61_peptide_mcmc folder into VMD.
- Analysis
  - Look for favorable protein-protein interactions that favor certain sequences, which would not be the case with affinity scoring. Later, when testing Glide, we can compare the two.
  - Run the code with and without catalytic constraints to see the differences.

Today Eric and I:

Learned how to use Pickle. Used these commands to load mcsad_data.p:

import sys
#The mcsad_data.p seems to need a module from scorefn_protein.py
sys.path.append('/home/rajchak/Simulation/Protein/mcsapython/discretemcsa')
import pickle
import scorefn_protein
myData = pickle.load(open('mcsad_data.p','rb'))
myData.keys()
#myData['time'] is wrong though, because of using time.clock() instead of time.time().

2/18/2012
Tasks Dr. Raj gave for Brad:

1. Skype with Eric on the DOCK diagram.
2. Make a text document delineating all command-line programs required to run DOCK.
3. Protein diagram with Rakesh.
4. That's all I can remember... I don't think I forgot anything. I'll check tomorrrow on the meeting minutes I took.

1/31/2012
I added an updated Visio diagram of my discrete MCSA program to the wiki.

C++ computational chemistry code development

The following is documentation for C++ protein structure prediction and design code under development in ~rajchak/Simulation/Protein on the PMC cluster. This code could ultimately be used (with completion and modification)

for customized structure prediction and molecular design applications for which existing third party executables are not sufficient.

protein_structure_prediction_code_documentation.zip

1) The C++ code consists (or will consist) of three major parts: i) loading sequences and structures from pdb files and substituting new amino acids to alter the sequence (i.e., design) ; ii) given a sequence in cartesian coordinate representation, convert it to internal (Z matrix) coordinates, use rotamer libraries to sample different possible dihedral angles, update the internal coordinate representation accordingly, and then convert back to cartesian coordinates; iii) Discrete Monte Carlo sequence search algorithms; with MPI parallelism.

The Python code focuses on i) and iii). For ii), it relies on using precompiled executables.

2) Since the Python code uses precompiled executables, we can focus on sequence search issues and rely on these executables for giving us accurate (but slower) energy calculations.

a) Discrete MCMC code was concurrently developed for protein sequence search with binding affinity scoring functions. The plan is to test it with two executables for the scoring function: Glide, Dock. [See above for notes on the mc code.]
The use of Dock scoring in the Python code was underway.

3) The C++ code should be compiled and tested in two stages:

a) Compilation of the pdb loading (pdbload.h), sequence substitution (sequence.h) and scoring (score.h) components with a dummy scoring function will be verified. A preliminary test of its ability to rank sequences according to the dummy scoring function will be carried out.

b) Meanwhile the discrete state space MCMC code will be completed and tested in C++ using a tsp problem (5-29 UPDATE: RC has developed a templated version of discrete state space MCMC: documentation required - see Documentation Format section. It has not yet been compiled and the tsp example application has not yet been added. We may avoid TSP and move directly to protein design apps.). Once this is finished, we will compile the MCMC code together with a) and test the use of MCMC to optimize protein sequences with a dummy scoring function.

c) Concurrently, the MPI implementation of MCMC will be developed (5-29 update: VM has developed a continuous state space MPI implementation of MCMC. Discrete state space version would be developed and applied with the templated code in b above). It will first be tested with the tsp problem, and then will be compiled with a) to sample protein sequences on multiple processors with a dummy scoring function.

d) As c) is being completed, 1.ii) will continue being developed. This involves compiling the structure.h and rotamer.h functions along with the earlier parts of the code. (5-29 UPDATE: documentation required - see Documentation Format section and integrate structure.h and rotamer.h into the documentation manual). Posted KMs ppt describing these coordinate transformation algorithms above.

e) Once c) is done, we will try to emulate the Python batch server results with C++ and MPI. I.e., we will use the same precompiled executables for scoring, but the MCMC will be run in C++ on multiple processors using MPI. Speed tests will be carried out comparing the Python and C++ implementations (similar to the comparisons we did last semester with PCR codes in matlab and C++. If the C++/MPI implementation is much faster, we may use that for some practical design applications.